Colombia COVID-19 - Central region
Our project
We decided to do central Colombia, basically because it is where the capital is.
We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.
We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.
Loading the dataset
This dataset is missing completely the department Valle del Cauca!
colombia_covid <- as.data.frame(read_csv("data/covid19co.csv"))
cols <- colnames(colombia_covid)[c(1, 4, 5, 6, 7, 8, 9, 11, 14)]
colombia_covid <- colombia_covid[cols]
colombia_covid <- colombia_covid[, c(1, 9, 2, 3, 4, 5, 6, 7, 8)]
colnames(colombia_covid) <- c("ID de caso", "Fecha de diagnóstico", "Ciudad de ubicación", "Departamento o Distrito", "Atención", "Edad" , "Sexo", "Tipo", "País de procedencia")
#colombia_covid$`Departamento o Distrito`[which(colombia_covid$`Departamento o Distrito` == "Valle Del Cauca")] <- "Valle del Cauca"
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca",
"Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]
colombia_covid <- colombia_covid[-which(colombia_covid$`Fecha de diagnóstico`== "-"), ]
head(colombia_covid, 5)## ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1 1 06/03/2020 Bogotá D.C. Bogotá D.C.
## 3 3 09/03/2020 Medellín Antioquia
## 4 4 11/03/2020 Medellín Antioquia
## 5 5 11/03/2020 Medellín Antioquia
## 6 6 11/03/2020 Itagüí Antioquia
## Atención Edad Sexo Tipo País de procedencia
## 1 Recuperado 19 F Importado Italia
## 3 Recuperado 50 F Importado España
## 4 Recuperado 55 M Relacionado Nan
## 5 Recuperado 25 M Relacionado Nan
## 6 Recuperado 27 F Relacionado Nan
Description of variables
ID de caso: ID of the confirmed case.
Fecha de diagnóstico: Date in which the disease was diagnosed.
Ciudad de ubicación: City where the case was diagnosed.
Departamento o Distrito: Department or district where the city belongs to.
Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.
Edad: Age of the confirmed case.
Sexo: Sex of the confirmed case.
Tipo: How the person got infected: in Colombia, abroad or unknown.
País de procedencia: Country of origin if the person got infected abroad.
Map
Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.
Preprocessing
We had to clean the dataset:
We transformed the
Fecha de diagnósticovariable into aDatetype variable,we fixed the variable
Id de caso(since we removed some departments, so some lines, the numbers weren’t consecutive),we created a variable
Grupo de edad,we cleaned the column
País de procedencia(replaced cities with the country) and created the variableContinente de procedencia(as the first is too fragmented we thought to consider the continents).
## ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1 1 2020-03-06 Bogotá D.C. Bogotá D.C.
## 2 2 2020-03-09 Medellín Antioquia
## 3 3 2020-03-11 Medellín Antioquia
## 4 4 2020-03-11 Medellín Antioquia
## 5 5 2020-03-11 Itagüí Antioquia
## 6 6 2020-03-11 Bogotá D.C. Bogotá D.C.
## 7 7 2020-03-11 Bogotá D.C. Bogotá D.C.
## 8 8 2020-03-12 Bogotá D.C. Bogotá D.C.
## 9 9 2020-03-12 Bogotá D.C. Bogotá D.C.
## 10 10 2020-03-13 Villavicencio Meta
## Atención Edad Sexo Tipo País de procedencia Grupo de edad
## 1 Recuperado 19 F Importado Italia 19_30
## 2 Recuperado 50 F Importado España 46_60
## 3 Recuperado 55 M Relacionado Nan 46_60
## 4 Recuperado 25 M Relacionado Nan 19_30
## 5 Recuperado 27 F Relacionado Nan 19_30
## 6 Recuperado 22 F Importado España 19_30
## 7 Recuperado 28 F Importado España 19_30
## 8 Recuperado 36 F Importado España 31_45
## 9 Recuperado 42 F Importado España 31_45
## 10 Recuperado 30 F Importado España 19_30
New datasets
Cases
## Date Elapsed time New cases/day Cumulative cases
## 1 2020-03-06 0 1 1
## 2 2020-03-09 3 1 2
## 3 2020-03-11 5 5 7
## 4 2020-03-12 6 2 9
## 5 2020-03-13 7 3 12
## 6 2020-03-14 8 14 26
## 7 2020-03-15 9 13 39
## 8 2020-03-16 10 8 47
## 9 2020-03-17 11 13 60
## 10 2020-03-18 12 9 69
Cases per department
## Date Elapsed time Department Department ID New cases/day
## 1 2020-03-09 3 Antioquia 1 1
## 2 2020-03-11 5 Antioquia 1 3
## 3 2020-03-14 8 Antioquia 1 3
## 4 2020-03-15 9 Antioquia 1 1
## 5 2020-03-19 13 Antioquia 1 3
## 6 2020-03-20 14 Antioquia 1 11
## 7 2020-03-21 15 Antioquia 1 3
## 8 2020-03-22 16 Antioquia 1 5
## 9 2020-03-23 17 Antioquia 1 22
## 10 2020-03-25 19 Antioquia 1 8
## Cumulative cases/Department Mean age
## 1 1 50.00000
## 2 4 35.66667
## 3 7 30.00000
## 4 8 55.00000
## 5 11 52.33333
## 6 22 39.81818
## 7 25 31.00000
## 8 30 45.40000
## 9 52 36.45455
## 10 60 29.75000
Exploring the dataset
Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):
the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.
on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.
## List of 3
## $ axis.line :List of 6
## ..$ colour : NULL
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.background: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.grid : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The previous plot represents the daily incidence of the desease across all the departments we are taking into account.
Frequency of cases across Departments
The major number of cases are in the capital Bogotà.
Cumulative number of confirmed cases
Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).
Variables plots
Spread of the desease across genders
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))
ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +
geom_bar(data = subset(colombia_covid, Sexo == "F")) +
geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) +
scale_y_continuous(breaks = brks,
labels = lbls) +
coord_flip() +
labs(title="Spread of the desease across genders",
y = "Number of cases",
x = "Department",
fill = "Gender") +
theme_tufte() +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank(),
axis.text.x = element_blank()) +
scale_fill_brewer(palette = "Dark3") The desease (number of cases) is more or less equally distributed across genders.
Distribution of the desease across ages
#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- colombia_covid %>%
group_by(`Grupo de edad`) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(`Grupo de edad`))
age_groups_pie$label <- scales::percent(age_groups_pie$per)
age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(`Grupo de edad`))) +
geom_bar(stat="identity", width = 1) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill="Age groups",
x=NULL,
y=NULL,
title="Distribution of the desease across ages") +
coord_polar(theta = "y") +
#geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
guides(fill = guide_legend(title = "Group"))
age_pie People from 31 to 45 years old are the most affected by the disease and people over 76 years old are the least affected. Colombia is a very young country. In 2018 the median age of the population was 30.4 years old and the largest age group is made of people from 25 to 54 years old, which comprises 41.98% of the population. (https://www.indexmundi.com/colombia/demographics_profile.html)
Age-Sex plot
There isn’t much difference between the sexes among the different group of ages.
Daily number per Tipo
theme_set(theme_classic())
ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_brewer(palette = "Set3") +
geom_histogram(aes(fill=Tipo), width = 0.8, stat="count") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across type",
x = "Date of confirmation",
fill = "Type")Tipo plot
I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:
type_pie <- colombia_covid %>%
group_by(Tipo) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(Tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("Tipo", "Total number", "Percentage")
type_pie## # A tibble: 3 x 3
## Tipo `Total number` Percentage
## <chr> <int> <chr>
## 1 Relacionado 8570 12%
## 2 Importado 687 1%
## 3 En Estudio 60406 87%
The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.
Correlation between the categorical variables
We used Cramer’s V to calculate the correlation between our categorical variables.
The frequentist approach
Train/test split
We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.
GLM
Poisson
Poisson with Elapsed time as predictor for Cumulative cases
## [1] "Estimated overdispersion 126.060786858437"
## [1] "RMSE: 1760.50922940986"
## [1] "AIC: 21915.0291925917"
## [1] "Null deviance: 1740689.09" "Residual deviance: 20716.39"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.158730158730159"
Prediction interval for test set
Poisson with Elapsed time as predictor for New cases/day
## [1] "Estimated overdispersion 50.354334425333"
## [1] "RMSE: 58886.3428634314"
## [1] "AIC: 7291.77457395168"
## [1] "Null deviance: 64369.08" "Residual deviance: 6442.53"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time plus Elapsed time^2 as predictor for Cumulative cases
## [1] "Estimated overdispersion 75.2646213768741"
## [1] "RMSE: 6940.37335779742"
## [1] "AIC: 12062.7254007421"
## [1] "Null deviance: 1740689.09" "Residual deviance: 10862.08"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time plus Elapsed time^2 as predictor for New cases/day
poisson2bis <- glm(`New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2),
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 292693.098510483"
## [1] "RMSE: 739.356869793342"
## [1] "AIC: 7291.99073697742"
## [1] "Null deviance: 64369.08" "Residual deviance: 6440.75"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time plus Sex for Cumulative cases
## [1] "RMSE: 3892.80826581986"
## [1] "AIC: 20887.674847589"
## [1] "Null deviance: 1740689.09" "Residual deviance: 19687.03"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.166666666666667"
Prediction interval for test set
Poisson with Elapsed time plus Sex as predictor for New cases/day
## [1] "Estimated overdispersion 314252.241298368"
## [1] "RMSE: 1810.41842007254"
## [1] "AIC: 4079.27639935841"
## [1] "Null deviance: 64369.08" "Residual deviance: 3228.03"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time plus Age group for Cumulative cases
poisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 439618.33736315"
## [1] "RMSE: 3249.31716470375"
## [1] "AIC: 20650.0253613547"
## [1] "Null deviance: 1740689.09" "Residual deviance: 19441.38"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.126984126984127"
Prediction interval for test set
Poisson with Elapsed time plus Age group as predictor for New cases/day
poisson4bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 315886.748769649"
## [1] "RMSE: 1948.8032473443"
## [1] "AIC: 3845.15073713936"
## [1] "Null deviance: 64369.08" "Residual deviance: 2985.91"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time plus Department for Cumulative cases
poisson5 <- glm(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 489833.895411907"
## [1] "RMSE: 3328.87135693358"
## [1] "AIC: 19275.3910967978"
## [1] "Null deviance: 1740689.09" "Residual deviance: 18054.75"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.0793650793650794"
Prediction interval for test set
Poisson with Elapsed time plus Department as predictor for New cases/day
poisson5bis <- glm(`New cases/day` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 354075.219540599"
## [1] "RMSE: 1447.54763192187"
## [1] "AIC: 2989.46506265985"
## [1] "Null deviance: 64369.08" "Residual deviance: 2118.22"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0"
Prediction interval for test set
Poisson with Elapsed time, Age group and Departments as predictors for Cumulative cases
poisson6 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
`Grupo de edad_46_60` + `Grupo de edad_60_75` +`Grupo de edad_76+` +
`Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
`Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
`Departamento o Distrito_Tolima`, data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 547354.078180419"
## [1] "RMSE: 8243.70282563776"
## [1] "AIC: 18085.2709276108"
## [1] "Null deviance: 1740689.09" "Residual deviance: 16854.63"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.119047619047619"
Prediction interval for test set
Poisson with Elapsed time, Age group and Departments as predictors for New cases/day
poisson6bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 6719176.07903642"
## [1] "RMSE: 1225.26041309972"
## [1] "AIC: 2860.38819023372"
## [1] "Null deviance: 64369.08" "Residual deviance: 1979.14"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.444444444444444"
Prediction interval for test set
Poisson with Elapsed time, Elapsed time^2, Age group and Departments as predictors for Cumulative cases
poisson7 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2) +`Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 1238590.22360691"
## [1] "RMSE: 8441.28307878747"
## [1] "AIC: 8781.84768119021"
## [1] "Null deviance: 1740689.09" "Residual deviance: 7549.2"
Predictive accuracy
## [1] "Frequency of coverage: 0.0416666666666667"
Prediction interval for test set
Poisson with Elapsed time, Elapsed time^2, Age group and Departments as predictors for New cases/day
poisson7bis <- glm(`New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2) +`Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=poisson)## [1] "Estimated overdispersion 14212701.7733774"
## [1] "RMSE: 3072.65556331751"
## [1] "AIC: 2225.47391742674"
## [1] "Null deviance: 64369.08" "Residual deviance: 1342.23"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.46031746031746"
Prediction interval for test set
Summary table - Poisson models for Cumulative cases
| Poisson models for cumulative cases | AIC | RMSE |
|---|---|---|
| Cumulative cases ~ Elapsed time | 21915.03 | 1760.51 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 | 12062.73 | 6940.37 |
| Cumulative cases ~ Elapsed time + Sex | 20887.67 | 3892.81 |
| Cumulative cases ~ Elapsed time + Age group | 20650.03 | 3249.32 |
| Cumulative cases ~ Elapsed time + Department | 19275.39 | 3328.87 |
| Cumulative cases ~ Elapsed time + Age group + Department | 18085.27 | 8243.7 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 + Age group + Department | 8781.85 | 8441.28 |
Summary table - Poisson models for New cases/day
| Poisson models for daily cases | AIC | RMSE |
|---|---|---|
| New cases/day ~ Elapsed time | 7291.77 | 58886.34 |
| New cases/day ~ Elapsed time + Elapsed time^2 | 7291.99 | 739.36 |
| New cases/day ~ Elapsed time + Sex | 4079.28 | 1810.42 |
| New cases/day ~ Elapsed time + Age group | 3845.15 | 1948.8 |
| New cases/day ~ Elapsed time + Department | 2989.47 | 1447.55 |
| New cases/day ~ Elapsed time + Age group + Department | 2860.39 | 1225.26 |
| New cases/day ~ Elapsed time + Elapsed time^2 + Age group + Department | 2225.47 | 3072.66 |
Autocorrelation to compare Poisson models
We generated 1000 samples from each of the four Poisson models and calculated the autocorrelation and compared against the autocorrelation of our original sample.
Autocorrelation for the models when the response variable is Cumulative Cases
Autocorrelation for the models when the response variable is New cases/day
ANOVA to compare the Poisson models
Response variable: Cumulative cases
## Analysis of Deviance Table
##
## Model 1: `Cumulative cases` ~ `Elapsed time`
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## Model 4: `Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 118 20716.4
## 2 113 19441.4 5 1275.0 < 2.2e-16 ***
## 3 102 16854.6 11 2586.8 < 2.2e-16 ***
## 4 101 7549.2 1 9305.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response variable: New cases/day
## Analysis of Deviance Table
##
## Model 1: `New cases/day` ~ `Elapsed time`
## Model 2: `New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2)
## Model 3: `New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 118 6442.5
## 2 117 6440.7 1 1.8 0.1817
## 3 101 1342.2 16 5098.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Quasi-Poisson
Quasi Poisson with Elapsed time as predictor for Cumulative cases
## [1] "Estimated overdispersion 3065.63883645847"
## [1] "RMSE: 1760.50922940986"
## [1] "Null deviance: 1740689.09" "Residual deviance: 20716.39"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.849206349206349"
Quasi Poisson with Elapsed time as predictors for New cases/day
## [1] "Estimated overdispersion 7743403.90238763"
## [1] "RMSE: 753.966239397641"
## [1] "Null deviance: 64369.08" "Residual deviance: 6442.53"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.183333333333333"
Quasi Poisson with Elapsed time and Elapsed time^2 as predictor for Cumulative cases
quasipoisson2 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2),
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 10518.5904954718"
## [1] "RMSE: 6940.37335779742"
## [1] "Null deviance: 1740689.09" "Residual deviance: 10862.08"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.849206349206349"
Quasi Poisson with Elapsed time and Elapsed time^2 as predictors for New cases/day
quasipoisson2bis <- glm(`New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2),
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 7530190.35400279"
## [1] "RMSE: 739.356869793342"
## [1] "Null deviance: 64369.08" "Residual deviance: 6440.75"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.25"
Quasi Poisson with Elapsed time and Sex as predictor for Cumulative cases
quasipoisson3 <- glm(`Cumulative cases` ~ `Elapsed time` + Sexo_M,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 3368.17928139048"
## [1] "RMSE: 3892.80826581986"
## [1] "Null deviance: 1740689.09" "Residual deviance: 19687.03"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.849206349206349"
### Quasi Poisson with Elapsed time and Sex as predictors for New cases/day |
r quasipoisson3bis <- glm(`New cases/day` ~ `Elapsed time` + Sexo_M, data=data1[1:120, ], family=quasipoisson) |
## [1] "Estimated overdispersion 4627403.99712943"
## [1] "RMSE: 1810.41842007254"
## [1] "Null deviance: 64369.08" "Residual deviance: 3228.03"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.225"
Quasi Poisson with Elapsed time and Age group as predictor for Cumulative cases
quasipoisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 3481.12259843348"
## [1] "RMSE: 3249.31716470375"
## [1] "Null deviance: 1740689.09" "Residual deviance: 19441.38"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.865079365079365"
Quasi Poisson with Elapsed time plus Age group as predictors for New cases/day
quasipoisson4bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 4986548.55338898"
## [1] "RMSE: 1948.8032473443"
## [1] "Null deviance: 64369.08" "Residual deviance: 2985.91"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.308333333333333"
Quasi Poisson with Elapsed time and Department as predictor for Cumulative cases
quasipoisson5 <- glm(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 3576.53352183996"
## [1] "RMSE: 3328.87135693358"
## [1] "Null deviance: 1740689.09" "Residual deviance: 18054.75"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.880952380952381"
Quasi Poisson with Elapsed time plus Department as predictors for New cases/day
quasipoisson5bis <- glm(`New cases/day` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 6172812.15892504"
## [1] "RMSE: 1447.54763192187"
## [1] "Null deviance: 64369.08" "Residual deviance: 2118.22"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.408333333333333"
Quasi Poisson with Elapsed time, Age and Department as predictor for Cumulative cases
quasipoisson6 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45`
+ `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` +
`Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` +`Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
`Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
`Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 3856.05087936372"
## [1] "RMSE: 8243.70282563776"
## [1] "Null deviance: 1740689.09" "Residual deviance: 16854.63"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.873015873015873"
Quasi Poisson with Elapsed time, Age group and Department as predictors for New cases/day
quasipoisson6bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 6719176.07903642"
## [1] "RMSE: 1225.26041309972"
## [1] "Null deviance: 64369.08" "Residual deviance: 1979.14"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.466666666666667"
Quasi Poisson with Elapsed time, Elapsed time^2, Age and Department as predictor for Cumulative cases
quasipoisson7 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30`
+ `Grupo de edad_31_45`+ `Grupo de edad_46_60` + `Grupo de edad_60_75`
+ `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 17160.3388275083"
## [1] "RMSE: 8441.28307878747"
## [1] "Null deviance: 1740689.09" "Residual deviance: 7549.2"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.896825396825397"
Quasi Poisson with Elapsed time, Elapsed time^2, Age group and Department as predictors for New cases/day
quasipoisson7bis <- glm(`New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ], family=quasipoisson)## [1] "Estimated overdispersion 14212701.7733774"
## [1] "RMSE: 3072.65556331751"
## [1] "Null deviance: 64369.08" "Residual deviance: 1342.23"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.5"
Summary table - Quasi Poisson models for Cumulative cases
| Quasi Poisson models for cumulative cases | RMSE |
|---|---|
| Cumulative cases ~ Elapsed time | 1760.51 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 | 6940.37 |
| Cumulative cases ~ Elapsed time + Sex | 3892.81 |
| Cumulative cases ~ Elapsed time + Age group | 3249.32 |
| Cumulative cases ~ Elapsed time + Department | 3328.87 |
| Cumulative cases ~ Elapsed time + Department + Age group | 8243.7 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 + Department + Age group | 8441.28 |
Summary table - Quasi Poisson models for New cases/day
| Quasi Poisson models for daily cases | RMSE |
|---|---|
| New cases/day ~ Elapsed time | 753.97 |
| New cases/day ~ Elapsed time + Elapsed time^2 | 739.36 |
| New cases/day ~ Elapsed time + Sex | 1810.42 |
| New cases/day ~ Elapsed time + Age group | 1948.8 |
| New cases/day ~ Elapsed time + Department | 1447.55 |
| New cases/day ~ Elapsed time + Department + Age group | 1225.26 |
| New cases/day ~ Elapsed time + Elapsed time^2 + Department + Age group | 3072.66 |
Negative Binomial
Negative Binomial with Elapsed time as predictor for Cumulative cases
## [1] "Estimated overdispersion 177.301055415717"
## [1] "RMSE: 1765.66109007326"
## [1] "AIC: 21911.1770465374"
## [1] "Null deviance: 1738722.15" "Residual deviance: 20710.41"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.158730158730159"
Prediction interval for test set
Negative Binomial with Elapsed time as pedictor for New cases/day
## [1] "Estimated overdispersion 1.18509200751555"
## [1] "RMSE: 630.336144980397"
## [1] "AIC: 1445.58057844612"
## [1] "Null deviance: 916.37" "Residual deviance: 136.75"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.944444444444444"
Negative Binomial with Elapsed time and Elapsed time^2 as predictors for Cumulative cases
## [1] "Estimated overdispersion 95.2440152571513"
## [1] "RMSE: 6946.45324195011"
## [1] "AIC: 12059.2678062916"
## [1] "Null deviance: 1739240.05" "Residual deviance: 10856.53"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.134920634920635"
Prediction interval for test set
Negative Binomial with Elapsed time and Elapsed time^2 as pedictors for New cases/day
## [1] "Estimated overdispersion 1.1928471628252"
## [1] "RMSE: 1196.18328341049"
## [1] "AIC: 1435.02770364321"
## [1] "Null deviance: 1000.61" "Residual deviance: 134.83"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.920634920634921"
Negative Binomial with Elapsed time and Sex as predictors for Cumulative cases
## [1] "Estimated overdispersion 171.54729431634"
## [1] "RMSE: 3898.50452450758"
## [1] "AIC: 20882.9362467527"
## [1] "Null deviance: 1738961.22" "Residual deviance: 19680.18"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.166666666666667"
Prediction interval for test set
Negative Binomial with Elapsed time and Sex as pedictors for New cases/day
## [1] "Estimated overdispersion 1.22488855750543"
## [1] "RMSE: 1182.57407376748"
## [1] "AIC: 1441.87950405362"
## [1] "Null deviance: 979.37" "Residual deviance: 139.09"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.928571428571429"
Negative Binomial with Elapsed time plus Age group as predictors for Cumulative cases
nb4 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30`+
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ])## [1] "Estimated overdispersion 178.172494672156"
## [1] "RMSE: 3253.44941468989"
## [1] "AIC: 20645.2695912123"
## [1] "Null deviance: 1738975.56" "Residual deviance: 19434.52"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.126984126984127"
Prediction interval for test set
Negative Binomial with Elapsed time and Age group as pedictors for New cases/day
nb4bis <- glm.nb(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30`+
`Grupo de edad_31_45` + `Grupo de edad_46_60` +
`Grupo de edad_60_75` + `Grupo de edad_76+`,
data=data1[1:120, ])## [1] "Estimated overdispersion 1.29900880793691"
## [1] "RMSE: 1943.62684829872"
## [1] "AIC: 1444.376892835"
## [1] "Null deviance: 1027.67" "Residual deviance: 139.39"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.944444444444444"
Negative Binomial with Elapsed time plus Department as predictors for Cumulative cases
nb5 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas`+
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ])## [1] "Estimated overdispersion 181.630453098407"
## [1] "RMSE: 3335.37126416692"
## [1] "AIC: 19270.2478165627"
## [1] "Null deviance: 1739087.22" "Residual deviance: 18047.5"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.0793650793650794"
Prediction interval for test set
Negative Binomial with Elapsed time and Departments as pedictors for New cases/day
nb5bis <- glm.nb(`New cases/day` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ])## [1] "Estimated overdispersion 1.36033553203779"
## [1] "RMSE: 2185.73149313813"
## [1] "AIC: 1431.44459445255"
## [1] "Null deviance: 1334.76" "Residual deviance: 145.73"
Predictive accuracy
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.944444444444444"
Negative Binomial with Elapsed time plus Continent of origin as predictors
# nbNO <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, data=data1[1:22, ])
# plot(nb4, which=1)
# nb4.pred <- predict(nb4, newdata = data1[23:25, ], type = "response")
# paste("RMSE:", sqrt(mean((nb4.pred - data1$`Cumulative cases`[23:25])^2)))
# #paste("MSE:", mean(nb4$residuals^2))
# paste("AIC:", nb4$aic)
# paste(c("Null deviance: ", "Residual deviance:"),
# round(c(nb4$null.deviance, deviance(nb4)), 2))Negative Binomial with Elapsed time, Age group and Departments as pedictors
nb6 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
`Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` +
`Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
`Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
`Departamento o Distrito_Tolima`, data=data1[1:120, ])## [1] "Estimated overdispersion 185.625787051898"
## [1] "RMSE: 8253.89918410353"
## [1] "AIC: 18080.4893960953"
## [1] "Null deviance: 1739076.35" "Residual deviance: 16847.74"
Predictive accuracy of the Negative Binomial model
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.119047619047619"
Prediction interval for test set
Negative Binomial with Elapsed time, Age group and Departments as pedictors for New cases/day
nb6bis <- glm.nb(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
`Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` +
`Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
`Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
`Departamento o Distrito_Tolima`, data=data1[1:120, ])## [1] "Estimated overdispersion 1.60944976076293"
## [1] "RMSE: 1379.56664915406"
## [1] "AIC: 1435.14403991019"
## [1] "Null deviance: 1416.29" "Residual deviance: 146.46"
Predictive accuracy of the NB model for New cases/day
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.936507936507937"
Negative Binomial with Elapsed time, Elapsed time^2, Age group and Departments as pedictors
nb7 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2) +`Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ])## [1] "Estimated overdispersion 92.8003925252854"
## [1] "RMSE: 8445.6104892801"
## [1] "AIC: 8780.25416767597"
## [1] "Null deviance: 1739084.62" "Residual deviance: 7545.51"
Predictive accuracy of the Negative Binomial model
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.214285714285714"
Prediction interval for test set
Negative Binomial with Elapsed time, Elapsed time^2, Age group and Departments as pedictors for New cases/day
nb7bis <- glm.nb(`New cases/day` ~ `Elapsed time` + I(`Elapsed time`^2) +`Grupo de edad_19_30` +
`Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
`Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
`Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
`Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
`Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
`Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
data=data1[1:120, ])## [1] "Estimated overdispersion 1.65569181835685"
## [1] "RMSE: 9304.26999604707"
## [1] "AIC: 1376.63703081842"
## [1] "Null deviance: 2432.64" "Residual deviance: 149.28"
Predictive accuracy of the NB model for New cases/day
Predicting with a \(95\%\) confidence interval
## [1] "Frequency of coverage: 0.888888888888889"
Summary table - Negative Binomial models for Cumulative cases
| Negative Binomial models for cumulative cases | AIC | RMSE |
|---|---|---|
| Cumulative cases ~ Elapsed time | 21911.18 | 1765.66 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 | 12059.27 | 6946.45 |
| Cumulative cases ~ Elapsed time + Sex | 20882.94 | 3898.5 |
| Cumulative cases ~ Elapsed time + Age group | 20645.27 | 3253.45 |
| Cumulative cases ~ Elapsed time + Department | 19270.25 | 3335.37 |
| Cumulative cases ~ Elapsed time + Age group + Department | 18080.49 | 8253.9 |
| Cumulative cases ~ Elapsed time + Elapsed time^2 + Age group + Department | 8780.25 | 8445.61 |
Summary table - Negative Binomial models for New cases/day
| Negative Binomial models for daily cases | AIC | RMSE |
|---|---|---|
| New cases/day ~ Elapsed time | 1445.58 | 630.34 |
| New cases/day ~ Elapsed time + Elapsed time^2 | 1435.03 | 1196.18 |
| New cases/day ~ Elapsed time + Sex | 1441.88 | 1182.57 |
| New cases/day ~ Elapsed time + Age group | 1444.38 | 1943.63 |
| New cases/day ~ Elapsed time + Department | 1431.44 | 2185.73 |
| New cases/day ~ Elapsed time + Age group + Department | 1435.14 | 1379.57 |
| New cases/day ~ Elapsed time + Elapsed time^2 + Age group + Department | 1376.64 | 9304.27 |
Autocorrelation to compare Negative Binomial models
We generated 1000 samples from each of the four Negative Binomial models and calculated the autocorrelation and compared against the autocorrelation of our original sample.
Autocorrelation for the models when the response variable is Cumulative Cases
Autocorrelation for the model when the response variable is New cases/day
Applying ANOVA to compare the negative binomial models
Response variable: Cumulative cases
## Likelihood ratio tests of Negative Binomial Models
##
## Response: Cumulative cases
## Model
## 1 Elapsed time
## 2 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima`
## 4 `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima`
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 11252779 118 -21905.177
## 2 12919544 113 -20629.270 1 vs 2 5 1275.907 0
## 3 13728065 102 -18042.489 2 vs 3 11 2586.780 0
## 4 13798920 101 -8740.254 3 vs 4 1 9302.235 0
Response variable: New cases/day
## Likelihood ratio tests of Negative Binomial Models
##
## Response: New cases/day
## Model
## 1 Elapsed time
## 2 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima`
## 4 `Elapsed time` + I(`Elapsed time`^2) + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima`
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 4.571965 118 -1439.581
## 2 5.156688 113 -1428.377 1 vs 2 5 11.20369 4.748776e-02
## 3 7.241381 102 -1397.144 2 vs 3 11 31.23285 1.011534e-03
## 4 12.990325 101 -1336.637 3 vs 4 1 60.50701 7.327472e-15
GAM
Gam with Elapsed time as covariate for Cumulative cases
## [1] "RMSE: 1282.65627040657"
## [1] "AIC: 1587.11794466718"
Prediction interval for test set
Intervals are very very very thin, we won’t include them in the following models.
Gam with Elapsed time as covariate for New cases/day
## [1] "RMSE: 1121.41502755924"
## [1] "AIC: 6361.02810119354"
Gam with Elapsed time and Elapsed time^2 as covariate for Cumulative cases
gam2 <- gam(Cumulative_cases ~ s(Elapsed_time) + I(Elapsed_time^2),
family = poisson, data = df[1:120,])## [1] "RMSE: 6047.00503603781"
## [1] "AIC: 1517.87827688163"
Gam with Elapsed time and Elapsed time^2 as covariate for New cases/day
gam2bis <- gam(New_cases_day ~ s(Elapsed_time) + I(Elapsed_time^2),
family = poisson, data = df[1:120,])## [1] "RMSE: 756.744620674939"
## [1] "AIC: 6341.56819625084"
Gam with Elapsed time and Sex as covariate for Cumulative cases
## [1] "RMSE: 2301.65489717445"
## [1] "AIC: 1525.13432986912"
Gam with Elapsed time and Sex as covariate for New cases/day
## [1] "RMSE: 1301.34999278558"
## [1] "AIC: 1239.36316647314"
Gam with Elapsed time and Age group as covariate for Cumulative cases
gam4 <- gam(Cumulative_cases ~ s(Elapsed_time) + Grupo_de_edad_19_30 +
Grupo_de_edad_31_45 + Grupo_de_edad_46_60 +
Grupo_de_edad_60_75 + Grupo_de_edad_76,
family = poisson(), data = df[1:120,])## [1] "RMSE: 911.823954533102"
## [1] "AIC: 1543.48204372495"
Gam with Elapsed time and Age group as covariate for New cases/day
gam4bis <- gam(New_cases_day ~ s(Elapsed_time) + s(Sexo_M) + Grupo_de_edad_19_30 +
Grupo_de_edad_31_45 + Grupo_de_edad_46_60 +
Grupo_de_edad_60_75 + Grupo_de_edad_76,
family = poisson(), data = df[1:120,])## [1] "RMSE: 412.270676088203"
## [1] "AIC: 1094.53096659504"
Gam with Elapsed time and Department as covariate for Cumulative cases
gam5 <- gam(Cumulative_cases ~ s(Elapsed_time) + Departamento_o_Distrito_Bogotá_D.C. +
Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima,
family = poisson(), data = df[1:120,])##
## Method: UBRE Optimizer: outer newton
## full convergence after 12 iterations.
## Gradient range [1.464349e-06,1.464349e-06]
## (score 1.687161 & scale 1).
## Hessian positive definite, eigenvalue range [0.0004925107,0.0004925107].
## Model rank = 21 / 21
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Elapsed_time) 9.00 8.97 0.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "RMSE: 704.139095232141"
## [1] "AIC: 1517.10332972498"
Gam with Elapsed time and Department as covariate for New cases/day
gam5bis <- gam(New_cases_day~s(Elapsed_time) + s(Sexo_M) + Departamento_o_Distrito_Bogotá_D.C. +
Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima,
family = poisson(), data = df[1:120,])##
## Method: UBRE Optimizer: outer newton
## full convergence after 11 iterations.
## Gradient range [-3.805153e-09,1.568086e-07]
## (score 1.087254 & scale 1).
## Hessian positive definite, eigenvalue range [0.0006433799,0.001348039].
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Elapsed_time) 9.00 8.91 1.00 0.47
## s(Sexo_M) 9.00 8.96 0.89 0.12
## [1] "RMSE: 398.085987887755"
## [1] "AIC: 1095.71518740438"
Gam with Elapsed time, Age group and Department as covariate for Cumulative cases
gam6 <- gam(Cumulative_cases ~ s(Elapsed_time) + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 +
Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76 +
Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá +
Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare +
Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca +
Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío +
Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander +
Departamento_o_Distrito_Tolima, family = poisson(), data = df[1:120,])## [1] "RMSE: 1038.67068102733"
## [1] "AIC: 1486.70786137681"
Gam with Elapsed time and Department and Age group as covariate for New cases/day
gam6bis <- gam(New_cases_day ~ s(Elapsed_time) + s(Sexo_M) + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 +
Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76 +
Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá +
Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare +
Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca +
Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío +
Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander +
Departamento_o_Distrito_Tolima, family = poisson(), data = df[1:120,])## [1] "RMSE: 868.400823218395"
## [1] "AIC: 1078.02668310693"
Gam with Elapsed time, Elapsed time^2, Age group and Department as covariate for Cumulative cases
gam7 <- gam(Cumulative_cases ~ s(Elapsed_time) + s(I(Elapsed_time^2)) + Grupo_de_edad_19_30 +
Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76 +
Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá +
Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare +
Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca +
Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío +
Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander +
Departamento_o_Distrito_Tolima, family = poisson(), data = df[1:120,])## [1] "RMSE: 4599.76926057587"
## [1] "AIC: 1337.1676523026"
Gam with Elapsed time, Elapsed time^2, Age group and Department as covariate for New cases/day
gam7bis <- gam(New_cases_day ~ s(Elapsed_time) + s(I(Elapsed_time^2)) + s(Sexo_M) + Grupo_de_edad_19_30 +
Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76 +
Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá +
Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare +
Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca +
Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío +
Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander +
Departamento_o_Distrito_Tolima, family = poisson(), data = df[1:120,])## [1] "RMSE: 1237.68954630396"
## [1] "AIC: 1077.19400426248"
Summary table - GAM models for Cumulative cases
| GAM models for daily cases | AIC | RMSE |
|---|---|---|
| Cumulative cases ~ s(Elapsed time) | 1587.12 | 1282.66 |
| Cumulative cases ~ s(Elapsed time) + s(Elapsed time) ^2) | 1517.88 | 6047.01 |
| Cumulative cases ~ s(Elapsed time) + s(Sex) | 1525.13 | 2301.65 |
| Cumulative cases ~ s(Elapsed time) + Age group | 1543.48 | 911.82 |
| Cumulative cases ~ s(Elapsed time) + Department | 1517.1 | 704.14 |
| Cumulative cases ~ s(Elapsed time) + Age group + Department | 1486.71 | 1038.67 |
| Cumulative cases ~ s(Elapsed time) + s(Elapsed time^2) + Age group + Department | 1337.17 | 4599.77 |
Summary table - GAM models for New cases/day
| GAM models for daily cases | AIC | RMSE |
|---|---|---|
| New cases/day ~ s(Elapsed time) | 6361.03 | 1121.42 |
| New cases/day ~ s(Elapsed time) + s(Elapsed time^2) | 6341.57 | 756.74 |
| New cases/day ~ s(Elapsed time) + s(Sex) | 1239.36 | 1301.35 |
| New cases/day ~ s(Elapsed time) + s(Sex) + Age group | 1094.53 | 412.27 |
| New cases/day ~ s(Elapsed time) + s(Sex) + Department | 1095.72 | 398.09 |
| New cases/day ~ s(Elapsed time) + s(Sex) + Age group + Department | 1078.03 | 868.4 |
| New cases/day ~ s(Elapsed time) + s(Elapsed time^2) + s(Sex) + Age group + Department | 1077.19 | 1237.69 |
ANOVA to compare the GAM models
Response variable: Cumulative cases
## Analysis of Deviance Table
##
## Model 1: Cumulative_cases ~ s(Elapsed_time)
## Model 2: Cumulative_cases ~ s(Elapsed_time) + Departamento_o_Distrito_Bogotá_D.C. +
## Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
## Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
## Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
## Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
## Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima
## Model 3: Cumulative_cases ~ s(Elapsed_time) + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 +
## Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76 +
## Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá +
## Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare +
## Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca +
## Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío +
## Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander +
## Departamento_o_Distrito_Tolima
## Model 4: Cumulative_cases ~ s(Elapsed_time) + s(I(Elapsed_time^2)) + Grupo_de_edad_19_30 +
## Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 +
## Grupo_de_edad_76 + Departamento_o_Distrito_Bogotá_D.C. +
## Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
## Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
## Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
## Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
## Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 110.001 372.54
## 2 99.001 280.52 11.0002 92.022 6.690e-15 ***
## 3 94.001 240.12 5.0001 40.402 1.239e-07 ***
## 4 86.005 74.68 7.9960 165.435 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response variable: New cases/day
## Analysis of Deviance Table
##
## Model 1: New_cases_day ~ s(Elapsed_time)
## Model 2: New_cases_day ~ s(Elapsed_time) + s(Sexo_M) + Grupo_de_edad_19_30 +
## Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 +
## Grupo_de_edad_76
## Model 3: New_cases_day ~ s(Elapsed_time) + s(Sexo_M) + Grupo_de_edad_19_30 +
## Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 +
## Grupo_de_edad_76 + Departamento_o_Distrito_Bogotá_D.C. +
## Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
## Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
## Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
## Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
## Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima
## Model 4: New_cases_day ~ s(Elapsed_time) + s(I(Elapsed_time^2)) + s(Sexo_M) +
## Grupo_de_edad_19_30 + Grupo_de_edad_31_45 + Grupo_de_edad_46_60 +
## Grupo_de_edad_60_75 + Grupo_de_edad_76 + Departamento_o_Distrito_Bogotá_D.C. +
## Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas +
## Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca +
## Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta +
## Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda +
## Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 110.004 5496.0
## 2 96.007 201.6 13.9967 5294.4 < 2.2e-16 ***
## 3 85.007 163.1 11.0004 38.5 6.459e-05 ***
## 4 82.878 158.6 2.1287 4.5 0.1148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Bayesian approach
Poisson regression for response variable Cumulative cases/Department
As a first attempt, we fit a simple Poisson regression: \[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \] with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.
For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.
Posterior predictive check - Poisson Cumulative cases/Department
y_rep <- as.matrix(fit1, pars="y_rep")
ppc_dens_overlay(y = model.data$cases, y_rep[1:400,]) +
coord_cartesian(xlim = c(-1, 7000))The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.
Residual check - Poisson Cumulative cases/Department
#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).
The plot of the standardized residuals indicates a large amount of overdispersion.
Intervals - Poisson Cumulative cases/Department
Accuracy - Poisson Cumulative cases/Department
Poisson model for New cases/day
Same model, but now \(y_i\) represents New cases/day.
Posterioir predictive check - Poisson new cases/day
y_rep <- as.matrix(fit1.1, pars="y_rep")
ppc_dens_overlay(y = model.data.cases$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 1500))Residual check - Poisson new cases/day
#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data.cases$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual, ylim=c(-5, 10)) + hline_at(2) + hline_at(-2)Intervals - Poisson New cases/day
Accuracy - Poisson New cases/day
Negative Binomial model for Cumulative cases /Department
We try to improve the previous model using the Negative Binomial model:
\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]
Where the parameter \(\phi\) is called precision and it is such that:
\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]
again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.
The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.
Posterior predictive check - NB Cumulative cases /Department
samples_NB <- rstan::extract(fit2)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 6000))Residual check - NB Cumulative cases /Department
mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The situation is better now, but still we have too many residuals outside the \(95\%\) interval.
Intervals - NB Cumulative cases /Department
ppc_intervals(
y = cases_dep$`Cumulative cases/Department`,
yrep = y_rep,
x = cases_dep$`Elapsed time`
) +
labs(x = "Days", y = "Cases")Accuracy across departments - NB Cumulative cases /Department
ppc_stat_grouped(
y = model.data$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "median",
binwidth = 0.2
)We should take into account the differences across departments.
Negative Binomial model for New cases/day
Posterior predictive check - NB New cases/day
samples_NB <- rstan::extract(fit2.1)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data.cases$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 1000))Residual check - NB New cases/day
mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The situation is better now, but still we have too many residuals outside the \(95\%\) interval.
Multilevel Negative Binomial regression for Cumulative/Department
We try to fit the following model, which also includes Age as covariat:
\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]
Posterior predictive check - multi NB Cumulative/Department
samples_NB2 <- rstan::extract(fit3)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 6000))Residual check - multi NB Cumulative/Department
mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Multilevel Negative Binomial Regression for New cases/day
Posterior predictive check - multi NB New cases/day
samples_NB2 <- rstan::extract(fit3.1)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2.cases$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 750))Residual check - multi NB New cases/day
mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Hierarchical model for Cumulative/Department
In order to improve the fit, we fit a model with department-specific intercept term.
So the varying intercept model that we take into account is now:
\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]
The priors used for the above model are the following:
\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]
being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).
New dataset
We added the following covariats into the dataset:
People: millions of inhabitants for each region;Surface: \(km^3\), extent of each region;Density: \(\frac{people}{km^2}\), density of the population in each region.
The model is:
Posterior predictive check - hierarchical NB Cumulative/Department
samples_hier <- rstan::extract(fit.4)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 6000))Residual check - hierarchical NB Cumulative/Department
mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Very few points are now outside the \(95\%\) confidence interval.
Intervals - hierarchical NB Cumulative/Department
ppc_intervals(
y = cases_dep$`Cumulative cases/Department`,
yrep = y_rep,
x = cases_dep$`Elapsed time`
) +
labs(x = "Days", y = "Cases")Accuracy across departments - hierarchical NB Cumulative/Department
ppc_stat_grouped(
y = data.hier.NB.complete$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "median",
binwidth = 0.2
)We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.
Hierarchical model for New cases/day
Posterior predictive check - hierarchical NB New cases/day
samples_hier <- rstan::extract(fit.4.1)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete.cases$cases, y_rep[1:200,]) +
coord_cartesian(xlim = c(-1, 500))Residual check - hierarchical NB New cases/day
mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual, ylim = c(-3, 4)) + hline_at(2) + hline_at(-2)Very few points are now outside the \(95\%\) confidence interval.
Intervals - hierarchical NB New cases/day
ppc_intervals(
y = cases_dep$`New cases/day`,
yrep = y_rep,
x = cases_dep$`Elapsed time`
) +
labs(x = "Days", y = "Cases")Accuracy across departments - hierarchical NB New cases/day
ppc_stat_grouped(
y = data.hier.NB.complete.cases$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "median",
binwidth = 0.2
)We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.
LOOIC
The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.
Plot the looic to compare models:
loo.all.deps<-c(loo.model.Poisson.cases[3], loo.model.NB.cases[3], loo.model.NB2.cases[3], loo.model.NB.hier.cases[3])
sort.loo.all.deps<- sort.int(loo.all.deps, index.return = TRUE)$x
par(xaxt="n")
plot(sort.loo.all.deps, type="b", xlab="", ylab="LOOIC", main="Model comparison")
par(xaxt="s")
axis(1, c(1:4), c("Poisson", "NB-sl", "NB-ml",
"hier")[sort.int(loo.all.deps,
index.return = TRUE)$ix],
las=2)